In [1]:

    
from ggplot import *
import pandas as pd
import numpy as np



In [ ]:

    
%matplotlib inline



In [3]:

    
df = pd.read_csv("./baseball-pitches-clean.csv")
df = df[['pitch_time', 'inning', 'pitcher_name', 'hitter_name', 'pitch_type', 
         'px', 'pz', 'pitch_name', 'start_speed', 'end_speed', 'type_confidence']]
df.head()









    Out[3]:






  
    
      
      pitch_time
      inning
      pitcher_name
      hitter_name
      pitch_type
      px
      pz
      pitch_name
      start_speed
      end_speed
      type_confidence
    
  
  
    
      0
       2013-10-01 20:07:43 -0400
       1
       Francisco Liriano
       Shin-Soo Choo
       B
       0.628
       1.547
       Fastball
       93.2
       85.3
       0.894
    
    
      1
       2013-10-01 20:07:57 -0400
       1
       Francisco Liriano
       Shin-Soo Choo
       S
       0.545
       3.069
       Fastball
       93.4
       85.6
       0.895
    
    
      2
       2013-10-01 20:08:12 -0400
       1
       Francisco Liriano
       Shin-Soo Choo
       S
       0.120
       1.826
         Slider
       89.1
       82.8
       0.931
    
    
      3
       2013-10-01 20:08:31 -0400
       1
       Francisco Liriano
       Shin-Soo Choo
       S
      -0.229
       1.667
         Slider
       90.0
       83.3
       0.926
    
    
      4
       2013-10-01 20:09:09 -0400
       1
       Francisco Liriano
        Ryan Ludwick
       B
      -1.917
       0.438
         Slider
       87.7
       81.6
       0.915
    
  

5 rows × 11 columns

Getting a feel for what's going on

`geom_point`

I usually start by making some really simple plots like scatterplots and histograms just to make sure that things make sense.

px and pz are the coordinates of a pitch as they cross home plate. Let's plot these and see if our data makes sense.



In [3]:

    
ggplot(df, aes(x='px', y='pz')) + geom_point()









    












    Out[3]:





<ggplot: (272839901)>

What about the pitch speed?



In [4]:

    
ggplot(aes(x='start_speed', y='end_speed'), data=df) + geom_point()









    












    Out[4]:





<ggplot: (276734237)>

`geom_hist`

A better way to inspect pitch speed might be to look at a distribution of the data.

Does this make sense? Let's consult the source of all true wisdom: https://answers.yahoo.com/question/index?qid=20080126131031AAwVCNk



In [4]:

    
ggplot(df, aes(x='start_speed')) + geom_histogram()









    



stat_bin: binwidth defaulted to range/30.
    Use 'binwidth = x' to adjust this.






    












    Out[4]:





<ggplot: (285457305)>

What about for specific pitches?



In [5]:

    
for name, frame in df.groupby("pitch_name"):
    print ggplot(aes(x='start_speed'), data=frame) + geom_histogram() + ggtitle("Distribution of " + str(name))









    












    



<ggplot: (288278377)>






    












    



<ggplot: (285224941)>






    












    



<ggplot: (288277409)>






    












    



<ggplot: (289871437)>






    












    



<ggplot: (293071473)>






    












    



<ggplot: (292574941)>






    












    



<ggplot: (289870441)>






    












    



<ggplot: (291709497)>

That was helpful but I'm sort of on plot overload now.

`facet_wrap` FTW

Use the trellis.

"Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets." Read more here.



In [6]:

    
ggplot(aes(x='start_speed'), data=df) +\
    geom_histogram() +\
    facet_wrap('pitch_name')









    



/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ggplot-0.5.9-py2.7.egg/ggplot/ggplot.py:198: RuntimeWarning: Facetting is currently not supported with geom_bar. See
                    https://github.com/yhat/ggplot/issues/196 for more information
  warnings.warn(msg, RuntimeWarning)






    












    Out[6]:





<ggplot: (292575897)>

Changeup, Curveball, Cut Fastball, Eephus....Wait, what?

http://en.wikipedia.org/wiki/Eephus_pitch



In [15]:

    
from IPython.display import YouTubeVideo
YouTubeVideo("ikLlRT2j7EQ")









    Out[15]:

Ok so what about balls and strikes.



In [8]:

    
ggplot(aes(x='pitch_type'), data=df) + geom_bar()









    












    Out[8]:





<ggplot: (275730281)>

`facet_grid`

(facet_wraps brother)



In [9]:

    
ggplot(aes(x='start_speed'), data=df) +\
    geom_histogram() +\
    facet_grid('pitch_type')









    












    Out[9]:





<ggplot: (276653609)>



In [12]:

    
ggplot(aes(x='start_speed'), data=df) +\
    geom_histogram() +\
    facet_grid('pitch_name', 'pitch_type', scales="free")









    












    Out[12]:





<ggplot: (271338625)>

`geom_density`

Similar to geom_histogram but relative y scale.



In [13]:

    
ggplot(df, aes(x='start_speed')) +\
    geom_density()









    












    Out[13]:





<ggplot: (275662825)>



In [14]:

    
ggplot(df, aes(x='start_speed', color='pitch_name')) +\
    geom_density()









    












    Out[14]:





<ggplot: (278182857)>



In [ ]:

	pitch_time	inning	pitcher_name	hitter_name	pitch_type	px	pz	pitch_name	start_speed	end_speed	type_confidence
0	2013-10-01 20:07:43 -0400	1	Francisco Liriano	Shin-Soo Choo	B	0.628	1.547	Fastball	93.2	85.3	0.894
1	2013-10-01 20:07:57 -0400	1	Francisco Liriano	Shin-Soo Choo	S	0.545	3.069	Fastball	93.4	85.6	0.895
2	2013-10-01 20:08:12 -0400	1	Francisco Liriano	Shin-Soo Choo	S	0.120	1.826	Slider	89.1	82.8	0.931
3	2013-10-01 20:08:31 -0400	1	Francisco Liriano	Shin-Soo Choo	S	-0.229	1.667	Slider	90.0	83.3	0.926
4	2013-10-01 20:09:09 -0400	1	Francisco Liriano	Ryan Ludwick	B	-1.917	0.438	Slider	87.7	81.6	0.915

Getting a feel for what's going on

geom_point

geom_hist

facet_wrap FTW